1 Kaggle Kernel Analysis: NYC Taxi EDA

Kernel name: “NYC Taxi EDA - Update: The fast & the curious”
Link: https://www.kaggle.com/headsortails/nyc-taxi-eda-update-the-fast-the-curious

1.1 Introduction

In this project we are going to explore the kernel above showing why the author choose the functions used, comparing to others alternatives and we are going to emphasize the interesting points of the kernel.

1.2 Load data

1.2.1 Original code

library("data.table")
library("tibble")

train <- as.tibble(fread('train.csv'))

1.2.2 Comparison with others alternatives

In R, read.csv is part of the regular functions and is used for load data.frame from a csv file. But when we’re dealing with a huge data.frame this function can take a long time to run.

print(paste("In this case the dataset is quite huge:",dim(train)[1], "rows and",
            dim(train)[2], "columns."))
## [1] "In this case the dataset is quite huge: 1458644 rows and 11 columns."

So in this part the author used a function called fread that performs much faster than read.csv (check the time of each function using profvis!!).
After that other function should be compared: load. This function is used to load variables that have been stored in a .RData file and runs very fast comparing with read.csv and fread.
When is a good ideia to use load? When it’s possible to use a background process to update the data.frame and save it in .RData file.
Let’s take a look at the three possibilities:

library("profvis")
library("data.table")
library("tibble")
library("readr")

profvis({
  # fread
  train <- fread("train.csv")
  test <- fread("test.csv")
  
  # read.csv
  train_readcsv <- read.csv("train.csv")
  
  # read_csv -> from "readr" package
  train_read_csv <- read_csv("train.csv")
  
  # as.tibble
  train <- as.tibble(train)
  test <- as.tibble(test)
  
  # loading RData
  save(train_readcsv, file = "train_data.RData")
  rm(train_readcsv)
  load(file = "train_data.RData")
})

1.2.3 Tibbles vs data frames

All the information bellow was “greped” from https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html
Tibbles
“Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors).”

Major points:

  • It never changes an input’s type (i.e., no more stringsAsFactors = FALSE!);
  • It never adjusts the names of variables (i.e names with spaces will keep the whitespace. Data.frame replaces whitespace for ‘.’);
  • When you print a tibble, it only shows the first ten rows and all the columns that fit on one screen;
  • Tibbles are quite strict about subsetting. [ ] always returns another tibble. Contrast this with a data frame: sometimes [ ] returns a data frame and sometimes it just returns a vector.

1.3 File structure and content

A brief overview of our data can summaries the descriptive statistics values of the dataset and detect abnormal items or outliers.

For the summaries

summary(train)
##       id              vendor_id     pickup_datetime    dropoff_datetime  
##  Length:1458644     Min.   :1.000   Length:1458644     Length:1458644    
##  Class :character   1st Qu.:1.000   Class :character   Class :character  
##  Mode  :character   Median :2.000   Mode  :character   Mode  :character  
##                     Mean   :1.535                                        
##                     3rd Qu.:2.000                                        
##                     Max.   :2.000                                        
##  passenger_count pickup_longitude  pickup_latitude dropoff_longitude
##  Min.   :0.000   Min.   :-121.93   Min.   :34.36   Min.   :-121.93  
##  1st Qu.:1.000   1st Qu.: -73.99   1st Qu.:40.74   1st Qu.: -73.99  
##  Median :1.000   Median : -73.98   Median :40.75   Median : -73.98  
##  Mean   :1.665   Mean   : -73.97   Mean   :40.75   Mean   : -73.97  
##  3rd Qu.:2.000   3rd Qu.: -73.97   3rd Qu.:40.77   3rd Qu.: -73.96  
##  Max.   :9.000   Max.   : -61.34   Max.   :51.88   Max.   : -61.34  
##  dropoff_latitude store_and_fwd_flag trip_duration    
##  Min.   :32.18    Length:1458644     Min.   :      1  
##  1st Qu.:40.74    Class :character   1st Qu.:    397  
##  Median :40.75    Mode  :character   Median :    662  
##  Mean   :40.75                       Mean   :    959  
##  3rd Qu.:40.77                       3rd Qu.:   1075  
##  Max.   :43.92                       Max.   :3526282
summary(test)
##       id              vendor_id     pickup_datetime    passenger_count
##  Length:625134      Min.   :1.000   Length:625134      Min.   :0.000  
##  Class :character   1st Qu.:1.000   Class :character   1st Qu.:1.000  
##  Mode  :character   Median :2.000   Mode  :character   Median :1.000  
##                     Mean   :1.535                      Mean   :1.662  
##                     3rd Qu.:2.000                      3rd Qu.:2.000  
##                     Max.   :2.000                      Max.   :9.000  
##  pickup_longitude  pickup_latitude dropoff_longitude dropoff_latitude
##  Min.   :-121.93   Min.   :37.39   Min.   :-121.93   Min.   :36.60   
##  1st Qu.: -73.99   1st Qu.:40.74   1st Qu.: -73.99   1st Qu.:40.74   
##  Median : -73.98   Median :40.75   Median : -73.98   Median :40.75   
##  Mean   : -73.97   Mean   :40.75   Mean   : -73.97   Mean   :40.75   
##  3rd Qu.: -73.97   3rd Qu.:40.77   3rd Qu.: -73.96   3rd Qu.:40.77   
##  Max.   : -69.25   Max.   :42.81   Max.   : -67.50   Max.   :48.86   
##  store_and_fwd_flag
##  Length:625134     
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Data overview

library("dplyr")
glimpse(train)
## Observations: 1,458,644
## Variables: 11
## $ id                 <chr> "id2875421", "id2377394", "id3858529", "id3...
## $ vendor_id          <int> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2...
## $ pickup_datetime    <chr> "2016-03-14 17:24:55", "2016-06-12 00:43:35...
## $ dropoff_datetime   <chr> "2016-03-14 17:32:30", "2016-06-12 00:54:38...
## $ passenger_count    <int> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1...
## $ pickup_longitude   <dbl> -73.98215, -73.98042, -73.97903, -74.01004,...
## $ pickup_latitude    <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40....
## $ dropoff_longitude  <dbl> -73.96463, -73.99948, -74.00533, -74.01227,...
## $ dropoff_latitude   <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration      <int> 455, 663, 2124, 429, 435, 443, 341, 1551, 2...
glimpse(test)
## Observations: 625,134
## Variables: 9
## $ id                 <chr> "id3004672", "id3505355", "id1217141", "id2...
## $ vendor_id          <int> 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1...
## $ pickup_datetime    <chr> "2016-06-30 23:59:58", "2016-06-30 23:59:53...
## $ passenger_count    <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 4, 1, 1, 1, 1...
## $ pickup_longitude   <dbl> -73.98813, -73.96420, -73.99744, -73.95607,...
## $ pickup_latitude    <dbl> 40.73203, 40.67999, 40.73758, 40.77190, 40....
## $ dropoff_longitude  <dbl> -73.99017, -73.95981, -73.98616, -73.98643,...
## $ dropoff_latitude   <dbl> 40.75668, 40.65540, 40.72952, 40.73047, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...

1.3.1 Comparison with others alternatives

Another popular way to make a data overview is using str. It is very similar to glimpse but str shows less data.

str(train)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1458644 obs. of  11 variables:
##  $ id                : chr  "id2875421" "id2377394" "id3858529" "id3504673" ...
##  $ vendor_id         : int  2 1 2 2 2 2 1 2 1 2 ...
##  $ pickup_datetime   : chr  "2016-03-14 17:24:55" "2016-06-12 00:43:35" "2016-01-19 11:35:24" "2016-04-06 19:32:31" ...
##  $ dropoff_datetime  : chr  "2016-03-14 17:32:30" "2016-06-12 00:54:38" "2016-01-19 12:10:48" "2016-04-06 19:39:40" ...
##  $ passenger_count   : int  1 1 1 1 1 6 4 1 1 1 ...
##  $ pickup_longitude  : num  -74 -74 -74 -74 -74 ...
##  $ pickup_latitude   : num  40.8 40.7 40.8 40.7 40.8 ...
##  $ dropoff_longitude : num  -74 -74 -74 -74 -74 ...
##  $ dropoff_latitude  : num  40.8 40.7 40.7 40.7 40.8 ...
##  $ store_and_fwd_flag: chr  "N" "N" "N" "N" ...
##  $ trip_duration     : int  455 663 2124 429 435 443 341 1551 255 1225 ...
##  - attr(*, ".internal.selfref")=<externalptr>

1.3.2 First observations

  • vendor_id only takes the values 1 or 2, presumably to differentiate two taxi companies
    We can easily check this doing:
levels(as.factor(train$vendor_id))
## [1] "1" "2"
  • pickup_datetime and (in the training set) dropoff_datetime are combinations of date and time that we will have to re-format into a more useful shape
  • passenger_count takes a median of 1 and a maximum of 9 in both data sets
  • The pickup/dropoff_longitute/latitute describes the geographical coordinates where the meter was activate/deactivated
  • store_and_fwd_flag is a flag that indicates whether the trip data was sent immediately to the vendor (“N”) or held in the memory of the taxi because there was no connection to the server (“Y”). Maybe there could be a correlation with certain geographical areas with bad reception?
  • trip_duration: our target feature in the training data is measured in seconds.

1.4 Missing values

To avoid an inappropriate analysis of the data, the missing values should be analysed to measure their impact in the whole dataset.
If the number of cases is less than 5% of the sample, then the researcher can drop them.
For more info about this subject: https://www.statisticssolutions.com/missing-values-in-data/
Luckly there is no missing values in data (easy mode):

sum(is.na(train))
## [1] 0
sum(is.na(test))
## [1] 0

1.5 Combining train and test

Here the author did an interesting move: he combined train and test data sets into a single one in order to avoid a closely approach that matches just one part of data.
CAUTION: we can only combine the two data sets for a better overview but for the creation of a machine learning model we should keep train and test separate.

# Mutate creates dset, dropff_datetime and trip_duration columns for test dataset
# For train dataset only dset is created by mutate
# bind_rows combines the data sets into one
combine <- bind_rows(train %>% mutate(dset = "train"), 
                     test %>% mutate(dset = "test",
                                     dropoff_datetime = NA,
                                     trip_duration = NA))
combine <- combine %>% mutate(dset = factor(dset))
glimpse(combine)
## Observations: 2,083,778
## Variables: 12
## $ id                 <chr> "id2875421", "id2377394", "id3858529", "id3...
## $ vendor_id          <int> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2...
## $ pickup_datetime    <chr> "2016-03-14 17:24:55", "2016-06-12 00:43:35...
## $ dropoff_datetime   <chr> "2016-03-14 17:32:30", "2016-06-12 00:54:38...
## $ passenger_count    <int> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1...
## $ pickup_longitude   <dbl> -73.98215, -73.98042, -73.97903, -74.01004,...
## $ pickup_latitude    <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40....
## $ dropoff_longitude  <dbl> -73.96463, -73.99948, -74.00533, -74.01227,...
## $ dropoff_latitude   <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration      <int> 455, 663, 2124, 429, 435, 443, 341, 1551, 2...
## $ dset               <fct> train, train, train, train, train, train, t...

1.6 Reformating features

For our following analysis, we will turn the data and time from characters into date objects. We also recode vendor_id as a factor. This makes it easier to visualise relationships that involve these features.

library('lubridate')
train <- train %>%
  mutate(pickup_datetime = ymd_hms(pickup_datetime),
         dropoff_datetime = ymd_hms(dropoff_datetime),
         vendor_id = factor(vendor_id),
         passenger_count = factor(passenger_count))

1.7 Consistency check

ASSUME NOTHING! It is worth checking whether the trip_durations are consistent with the intervals between the pickup_datetime and dropoff_datetime. Presumably the former were directly computed from the latter, but you never know. Below, the check variable shows “TRUE” if the two intervals are not consistent:

train %>%
  mutate(check = abs(int_length(interval(dropoff_datetime,pickup_datetime)) + trip_duration) > 0) %>%
  select(check, pickup_datetime, dropoff_datetime, trip_duration) %>%
  group_by(check) %>%
  count()
## # A tibble: 1 x 2
## # Groups:   check [1]
##   check       n
##   <lgl>   <int>
## 1 FALSE 1458644

And we find that everything fits perfectly.

2 Individual feature visualizations

“Visualizations of feature distributions and their relations are key to understanding a data set, and they often open up new lines of inquiry. I always recommend to examine the data from as many different perspectives as possible to notice even subtle trends and correlations.”

2.1 Leaflet package for interative maps

The author used a wonderful package for interative maps called leaflet. There a couple of models that you can try. We changed a little bit to try the leaflet-extras providers.

library(leaflet)
library(leaflet.extras)
set.seed(1234)
foo <- sample_n(train, 8e3)

  leaflet(data = foo) %>% addProviderTiles(providers$Esri.WorldTopoMap) %>%
  addCircleMarkers(~pickup_longitude, ~pickup_latitude, radius = 1,
                   color = "blue", fillOpacity = 0.3)

Using this visualization feature we notice the majority points of our trips (Manhattan and JFK airport).

Trip Duration

library(ggplot2)

train %>%
  ggplot(aes(trip_duration)) +
  geom_histogram(fill = "red", bins = 150) +
  scale_x_log10() +
  scale_y_sqrt() +
  labs(title = "New York Taxi - EDA", x = "Trip Duration (s)", y = "Number of Events")

We find:

  • the majority of rides follow a rather smooth distribution that looks almost log-normal with a peak just short of 1000 seconds, i.e. about 17 minutes.
  • There are several suspiciously short rides with less than 10 seconds duration.
  • Additionally, there is a strange delta-shaped peak of trip_duration just before the 1e5 seconds mark and even a few way above it:
train %>%
  arrange(desc(trip_duration)) %>%
  select(trip_duration, pickup_datetime, dropoff_datetime, everything()) %>%
  head(10)
## # A tibble: 10 x 11
##    trip_duration pickup_datetime     dropoff_datetime    id      vendor_id
##            <int> <dttm>              <dttm>              <chr>   <fct>    
##  1       3526282 2016-02-13 22:46:52 2016-03-25 18:18:14 id0053~ 1        
##  2       2227612 2016-01-05 06:14:15 2016-01-31 01:01:07 id1325~ 1        
##  3       2049578 2016-02-13 22:38:00 2016-03-08 15:57:38 id0369~ 1        
##  4       1939736 2016-01-05 00:19:42 2016-01-27 11:08:38 id1864~ 1        
##  5         86392 2016-02-15 23:18:06 2016-02-16 23:17:58 id1942~ 2        
##  6         86391 2016-05-31 13:00:39 2016-06-01 13:00:30 id0593~ 2        
##  7         86390 2016-05-06 00:00:10 2016-05-07 00:00:00 id0953~ 2        
##  8         86387 2016-06-30 16:37:52 2016-07-01 16:37:39 id2837~ 2        
##  9         86385 2016-06-23 16:01:45 2016-06-24 16:01:30 id1358~ 2        
## 10         86379 2016-05-17 22:22:56 2016-05-18 22:22:35 id2589~ 2        
## # ... with 6 more variables: passenger_count <fct>,
## #   pickup_longitude <dbl>, pickup_latitude <dbl>,
## #   dropoff_longitude <dbl>, dropoff_latitude <dbl>,
## #   store_and_fwd_flag <chr>

Those records would correspond to 24-hour trips and beyond, with a maximum of almost 12 days. I know that rush hour can be bad, but those values are a little unbelievable.

Over the year, the distributions of pickup_datetime and dropoff_datetime look like this: mark and even a few way above it:

p1 <- train %>%
  ggplot(aes(pickup_datetime)) +
  geom_histogram(fill = "red", bins = 120) +
  labs(x = "Pickup dates")

p2 <- train %>%
  ggplot(aes(dropoff_datetime)) +
  geom_histogram(fill = "blue", bins = 120) +
  labs(x = "Dropoff dates")

layout <- matrix(c(1,2),2,1,byrow=FALSE)
multiplot(p1, p2, layout=layout)

Fairly homogeneous, covering half a year between January and July 2016. There is an interesting drop around late January early February:

train %>%
  filter(pickup_datetime > ymd("2016-01-20") & pickup_datetime < ymd("2016-02-10")) %>%
  ggplot(aes(pickup_datetime)) +
  geom_histogram(fill = "red", bins = 120)

2.2 Raised questions from pickup_datetime data visualization

That’s winter in NYC, so maybe snow storms or other heavy weather? Events like this should be taken into account, maybe through some handy external data set?
In the plot above we can already see some daily and weekly modulations in the number of trips. Let’s investigate these variations together with the distributions of passenger_count and vendor_id by creating a multi-plot panel with different components:

p1 <- train %>%
  group_by(passenger_count) %>%
  count() %>%
  ggplot(aes(passenger_count, n, fill = passenger_count)) +
  geom_col() +
  scale_y_sqrt() +
  theme(legend.position = "none")

p2 <- train %>%
  ggplot(aes(vendor_id, fill = vendor_id)) +
  geom_bar() +
  theme(legend.position = "none")

p3 <- train %>%
  ggplot(aes(store_and_fwd_flag)) +
  geom_bar() +
  theme(legend.position = "none") +
  scale_y_log10()

p4 <- train %>%
  mutate(wday = wday(pickup_datetime, label = TRUE)) %>%
  group_by(wday, vendor_id) %>%
  count() %>%
  ggplot(aes(wday, n, colour = vendor_id)) +
  geom_point(size = 4) +
  labs(x = "Day of the week", y = "Total number of pickups") +
  theme(legend.position = "none")

p5 <- train %>%
  mutate(hpick = hour(pickup_datetime)) %>%
  group_by(hpick, vendor_id) %>%
  count() %>%
  ggplot(aes(hpick, n, color = vendor_id)) +
  geom_point(size = 4) +
  labs(x = "Hour of the day", y = "Total number of pickups") +
  theme(legend.position = "none")

layout <- matrix(c(1,2,3,4,5,5),3,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, layout=layout)

We find:

  • There are a few trips with zero, or seven to nine passengers but they are a rare exception:
train %>%
  group_by(passenger_count) %>%
  count()
## # A tibble: 10 x 2
## # Groups:   passenger_count [10]
##    passenger_count       n
##    <fct>             <int>
##  1 0                    60
##  2 1               1033540
##  3 2                210318
##  4 3                 59896
##  5 4                 28404
##  6 5                 78088
##  7 6                 48333
##  8 7                     3
##  9 8                     1
## 10 9                     1
  • The vast majority of rides had only a single passenger, with two passengers being the (distant) second most popular option.
  • Towards larger passenger numbers we are seeing a smooth decline through 3 to 4, until the larger crowds (and larger cars) give us another peak at 5 to 6 passengers.
  • Vendor 2 has significantly more trips in this data set than vendor 1 (note the logarithmic y-axis). This is true for every day of the week.
  • We find an interesting pattern with Monday being the quietest day and Friday very busy. This is the same for the two different vendors, with vendor_id == 2 showing significantly higher trip numbers.
  • As one would intuitively expect, there is a strong dip during the early morning hours. There we also see not much difference between the two vendors. We find another dip around 4pm and then the numbers increase towards the evening.
  • The store_and_fwd_flag values, indicating whether the trip data was sent immediately to the vendor (“N”) or held in the memory of the taxi because there was no connection to the server (“Y”), show that there was almost no storing taking place (note again the logarithmic y-axis):
train %>%
  group_by(store_and_fwd_flag) %>%
  count()
## # A tibble: 2 x 2
## # Groups:   store_and_fwd_flag [2]
##   store_and_fwd_flag       n
##   <chr>                <int>
## 1 N                  1450599
## 2 Y                     8045
y_count <- table(train$store_and_fwd_flag)['Y']/sum(table(train$store_and_fwd_flag))
paste0('Trip data stored in memory due to no connection represents ',round(y_count*100, digits = 2),'% of the values.')
## [1] "Trip data stored in memory due to no connection represents 0.55% of the values."

2.3 Time series graphics

The trip volume per hour of the day depends somewhat on the month and strongly on the day of the week:

p1 <- train %>%
  mutate(hpick = hour(pickup_datetime),
         Month = factor(month(pickup_datetime, label = TRUE))) %>%
  group_by(hpick, Month) %>%
  count() %>%
  ggplot(aes(hpick, n, color = Month)) +
  geom_line(size = 1.5) +
  labs(x = "Hour of the day", y = "count")

p2 <- train %>%
  mutate(hpick = hour(pickup_datetime),
         wday = factor(wday(pickup_datetime, label = TRUE))) %>%
  group_by(hpick, wday) %>%
  count() %>%
  ggplot(aes(hpick, n, color = wday)) +
  geom_line(size = 1.5) +
  labs(x = "Hour of the day", y = "count")

layout <- matrix(c(1,2),2,1,byrow=FALSE)
multiplot(p1, p2, layout=layout)

We find:

  • January and June have fewer trips, whereas March and April are busier months. This tendency is observed for both vendor_ids.
  • The weekend (Sat and Sun, plus Fri to an extend) have higher trip numbers during the early morning ours but lower ones in the morning between 5 and 10, which can most likely be attributed to the contrast between NYC business days and weekend night life. In addition, trip numbers drop on a Sunday evening/night.
    Finally, we will look at a simple overview visualization of the pickup/dropoff latitudes and longitudes:
p1 <- train %>%
  filter(pickup_longitude > -74.05 & pickup_longitude < -73.7) %>%
  ggplot(aes(pickup_longitude)) +
  geom_histogram(fill = "red", bins = 40)

p2 <- train %>%
  filter(dropoff_longitude > -74.05 & dropoff_longitude < -73.7) %>%
  ggplot(aes(dropoff_longitude)) +
  geom_histogram(fill = "blue", bins = 40)

p3 <- train %>%
  filter(pickup_latitude > 40.6 & pickup_latitude < 40.9) %>%
  ggplot(aes(pickup_latitude)) +
  geom_histogram(fill = "red", bins = 40)

p4 <- train %>%
  filter(dropoff_latitude > 40.6 & dropoff_latitude < 40.9) %>%
  ggplot(aes(dropoff_latitude)) +
  geom_histogram(fill = "blue", bins = 40)

layout <- matrix(c(1,2,3,4),2,2,byrow=FALSE)
multiplot(p1, p2, p3, p4, layout=layout)

Here we had constrain the range of latitude and longitude values, because there are a few cases which are way outside the NYC boundaries. The resulting distributions are consistent with the focus on Manhattan that we had already seen on the map. These are the most extreme values from the pickup_latitude feature:

train %>%
  arrange(pickup_latitude) %>%
  select(pickup_latitude, pickup_longitude) %>%
  head(5)
## # A tibble: 5 x 2
##   pickup_latitude pickup_longitude
##             <dbl>            <dbl>
## 1            34.4            -65.8
## 2            34.7            -75.4
## 3            35.1            -71.8
## 4            35.3            -72.1
## 5            36.0            -77.4
train %>%
  arrange(desc(pickup_latitude)) %>%
  select(pickup_latitude, pickup_longitude) %>%
  head(5)
## # A tibble: 5 x 2
##   pickup_latitude pickup_longitude
##             <dbl>            <dbl>
## 1            51.9            -72.8
## 2            44.4            -67.0
## 3            43.9            -71.9
## 4            43.5            -74.2
## 5            43.1            -72.6

We need to keep the existence of these (rather astonishing) values in mind so that they don’t bias our analysis.